🕵🏻 Anomaly Detection in 📈 Multivariate ⏳ Time-Series
- Resolved Data Anomalies/Outliers in 3 different Multivariate Time Series Data Sets related to Health Care, Tourism, and Transportation Sectors.
- Implemented and Compared ANN, VAR, Isolation forests, K-Means Clustering, PCA, and SVM to get the best model accuracy.
- Built an efficient Autoencoders Model with an accuracy of 94.96%.
- Applied Neural Networks and Deep Learning Methods for detecting anomalies.
1️⃣ Problem Statement 1: ECG 🏥¶
The objective of detecting anomalies in ECG signals consists of finding the irregular heart rates, heartbeats, and rhythms. To achieve this goal, an anomaly detection system must be able to find them on all heartbeat sequences; therefore, to obtain the essential metrics.
An electrocardiogram (ECG) is a simple test that can be used to check your heart's rhythm and electrical activity. Sensors attached to the skin are used to detect the electrical signals produced by your heart each time it beats.
Autoencoders
Autoencoder is an important application of Neural Networks or Deep Learning. It is widely used in dimensionality reduction, image compression, image denoising, and feature extraction. It is also applied in anomaly detection and has delivered superior results.
This notebook uses autoencoders for detecting anomaly in ECG(electro cardiogram) readings. This is one of the very good practical application of autoencoders.
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from tensorflow.keras import layers, losses
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Model
import plotly.graph_objects as go
import warnings
warnings.filterwarnings('ignore')
def show_traces(
df: pd.DataFrame
):
fig = go.Figure()
for index, row in df.iterrows():
fig.add_trace(go.Scatter(
x=df.columns, # 假设列名代表时间点
y=row.values, # 当前行的值
mode='lines', # 绘制线条模式
name=str(index) # 使用行索引作为图例名称
))
fig.show()
df = pd.read_csv('http://storage.googleapis.com/download.tensorflow.org/data/ecg.csv', header=None)
print(df.shape)
show_traces(df.iloc[0:10,:])
(4998, 141)
👉🏻 Note: The dataset has 140 columns which represents the ECG readings and a labels column which has been encoded to 0 or 1 showing whether the ECG is abnormal or normal.
#separate the data and labels so that it will be easy for understanding
data = df.iloc[:,:-1].values
labels = df.iloc[:,-1].values
labels
array([1., 1., 1., ..., 0., 0., 0.])
train_data, test_data, train_labels, test_labels = train_test_split(data, labels, test_size = 0.2, random_state = 21)
Normalizing the data to the range [0-1]
#Now lets Normalize the data
#First we will calculate the maximum and minimum value from the training set
min = tf.reduce_min(train_data)
max = tf.reduce_max(train_data)
#Now we will use the formula (data - min)/(max - min)
train_data = (train_data - min)/(max - min)
test_data = (test_data - min)/(max - min)
#I have converted the data into float
train_data = tf.cast(train_data, dtype=tf.float32)
test_data = tf.cast(test_data, dtype=tf.float32)
#The labels are either 0 or 1, so I will convert them into boolean(true or false)
train_labels = train_labels.astype(bool)
test_labels = test_labels.astype(bool)
#Now let's separate the data for normal ECG from that of abnormal ones
#Normal ECG data
n_train_data = train_data[train_labels]
n_test_data = test_data[test_labels]
#Abnormal ECG data
an_train_data = train_data[~train_labels]
an_test_data = test_data[~test_labels]
print(n_train_data)
tf.Tensor( [[0.57030463 0.46561658 0.29058117 ... 0.48504233 0.4233502 0.47598344] [0.48320588 0.28246963 0.16471253 ... 0.567567 0.4677294 0.2692329 ] [0.48144642 0.35151404 0.25972766 ... 0.5479421 0.5077544 0.54298663] ... [0.41039047 0.24164985 0.13120876 ... 0.5277313 0.5654091 0.5023885 ] [0.5397748 0.4140786 0.28101394 ... 0.51266515 0.43706053 0.4426865 ] [0.29639772 0.15988176 0.18883787 ... 0.53766966 0.545786 0.40826708]], shape=(2359, 140), dtype=float32)
#Lets plot a normal ECG
plt.plot(np.arange(140), n_train_data[0])
plt.grid()
plt.title('Normal ECG')
plt.show()
#Lets plot one from abnormal ECG
plt.plot(np.arange(140), an_train_data[0])
plt.grid()
plt.title('Abnormal ECG')
plt.show()
1.1 Autoencoders¶
Autoencoders are a specific type of feedforward neural network.
It consists of two parts:- 1.Encoder 2.Decoder
1.2 Analysis 📝¶
❓ How the model will detect anomaly ?
We will create an encoder and a decoder using an ANN architecture. We are going to provide the ECG data as input and the model will try to reconstruct it. The error between the original data and reconstructed output will be called the reconstruction error. Based on this recostruction error we are going to classify an ECG as anomalous or not.In order to do this we are going to train the model only on the normal ECG data but it will be tested on the full test set, so that when an abnormal ECG is provided in the input the autoencoder will try to reconstruct it but since it has been only trained on normal ECG data the output will have a larger reconstruction error. We will also define a minimum threshold for the error i.e. if the reconstruction error is above the threshold then it will be categorised as anomalous.
For example, given an image of a handwritten digit, an autoencoder first encodes the image into a lower dimensional representation, then decodes it back to an image. It learns to compress the data while minimizing the reconstruction error.
👉🏻 In simple words, AutoEncoder is an unsupervised Artificial Neural Network that attempts to encode the data by compressing it into the lower dimensions (bottleneck layer or code) and then decoding the data to reconstruct the original input. The bottleneck layer (or code) holds the compressed representation of the input data.
#Now let's define the model!
#Here I have used the Model Subclassing API (but we can also use the Sequential API)
#The model has 2 parts : 1. Encoder and 2. Decoder
class detector(Model):
def __init__(self):
super(detector, self).__init__()
self.encoder = tf.keras.Sequential([
layers.Dense(32, activation='relu'),
layers.Dense(16, activation='relu'),
layers.Dense(8, activation='relu')
])
self.decoder = tf.keras.Sequential([
layers.Dense(16, activation='relu'),
layers.Dense(32, activation='relu'),
layers.Dense(140, activation='sigmoid')
])
def call(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return decoded
👉🏻 NOTE: See that in fit() both the data are same i.e. n_train_data, the reason is that we will be comparing the original ECG with the reconstructed one to calculate the reconstruction error. Autoencoders are unsupervised learning models but here we are training them using supervised method so its more like they are used as self-supervised.
#Let's compile and train the model!!
autoencoder = detector()
autoencoder.compile(optimizer='adam', loss='mae')
autoencoder.fit(n_train_data, n_train_data, epochs = 20, batch_size=512, validation_data=(n_test_data, n_test_data))
Epoch 1/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 1s 31ms/step - loss: 0.0584 - val_loss: 0.0562 Epoch 2/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 0.0557 - val_loss: 0.0532 Epoch 3/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 8ms/step - loss: 0.0523 - val_loss: 0.0489 Epoch 4/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0481 - val_loss: 0.0447 Epoch 5/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0438 - val_loss: 0.0404 Epoch 6/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0397 - val_loss: 0.0366 Epoch 7/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0364 - val_loss: 0.0337 Epoch 8/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0335 - val_loss: 0.0314 Epoch 9/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0312 - val_loss: 0.0298 Epoch 10/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0297 - val_loss: 0.0285 Epoch 11/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0284 - val_loss: 0.0273 Epoch 12/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 6ms/step - loss: 0.0272 - val_loss: 0.0262 Epoch 13/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0262 - val_loss: 0.0252 Epoch 14/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0253 - val_loss: 0.0243 Epoch 15/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0243 - val_loss: 0.0235 Epoch 16/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0236 - val_loss: 0.0228 Epoch 17/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0230 - val_loss: 0.0221 Epoch 18/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0223 - val_loss: 0.0215 Epoch 19/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0214 - val_loss: 0.0209 Epoch 20/20 5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 7ms/step - loss: 0.0211 - val_loss: 0.0205
<keras.src.callbacks.history.History at 0x1d564d21d50>
#Now let's define a function in order to plot the original ECG and reconstructed ones and also show the error
def plot(data, n):
enc_img = autoencoder.encoder(data)
dec_img = autoencoder.decoder(enc_img)
plt.plot(data[n], 'b')
plt.plot(dec_img[n], 'r')
plt.fill_between(np.arange(140), data[n], dec_img[n], color = 'lightcoral')
plt.legend(labels=['Input', 'Reconstruction', 'Error'])
plt.show()
plot(n_test_data, 0)
plot(an_test_data, 0)
👉🏻 As I mentioned earlier an ECG is anomalous if it is greater than a threshold. We can set the threshold in any way we want. Here I am going to set it to one standard deviation from the mean of normal training data.
reconstructed = autoencoder(n_train_data)
train_loss = losses.mae(reconstructed, n_train_data)
t = np.mean(train_loss) + np.std(train_loss)
def prediction(model, data, threshold):
rec = model(data)
loss = losses.mae(rec, data)
return tf.math.less(loss, threshold)
print(t)
0.033098478
pred = prediction(autoencoder, n_test_data, t)
print(pred)
tf.Tensor( [False True True False True True True False True True True True True True True True True True True True False True True True True True True True True True True True True True False True True True True False True True True True True True True True True True True True False True True True True True True True True True True True False True True True True True True True True True True True True True True False True True True True False True True True True True True True True True True True True True True True True True False True True True True True True True True True True True True True True True True True True False True True True True True True True True True True True True True True True True True False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False False True True True True True True True True False True True True True True True True True True True True True True True True True True False False True True True True True True True True True True True True True True False True True True True True True True True True True True True True True False True True True True True True True True True True True True True True True True True True True True False True True True True True True True True True True True True False True True True True True True False True True True True True True True True True True True True True True True True True True True False True True False True True True True True True True True True True True True True True True True True True True True True True True True True True True False True False True True True True True True True True True True True True False True True False True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True True False True True True True True True True True True True True True True True False False True True True False True True True True True True True True True True False False False True False False True False True False False True True True True True True True True True True True True True True True True True True True True True True True True True False True True True True True True True True True True True True True True True True True True True True True False True False True True True False False False True False True True True True True True True True True True True True True True True True True True True True True True True True True False True True True False True True], shape=(560,), dtype=bool)
#Lets see some more result visually !!
plot(n_test_data, 0)
plot(n_test_data, 1)
plot(n_test_data, 3)
This model was just the basic model and it can be improved by doing hyperparameter tuning and making the encoder and decoder with DNN. The threshold was determined using a very simple method and it can be also changed for getting better and more accurate results. The criteria for determinig the threshold can make a lot of difference.
1.3 Accuracy 🎯¶
Out of 2359 Actual Shape of Prediction, 2240 were predicted correctly. This results in accuracy ≈ 94.96 %
threshold = np.mean(train_loss) + 2*np.std(train_loss)
preds = tf.math.less(train_loss, threshold)
tf.math.count_nonzero(preds)
<tf.Tensor: shape=(), dtype=int64, numpy=2242>
preds.shape
TensorShape([2359])
acc = 2240/2359*100
print("Accuracy = ", acc, "%")
Accuracy = 94.95548961424333 %
🕵🏻 Anomaly Detection: Expedia Hotel 🏨
2️⃣ Problem Statement 2: Expedia Hotel 🏨¶
The Dataset is related to tourism sector and is multivariate, dependent on time series. The main objective here is to check and observe the hotel prices from the data set of expedia hotel search. I have implemented different models here to check the prices hikes and lows.
import seaborn as sns
import matplotlib.dates as md
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import host_subplot
import mpl_toolkits.axisartist as AA
import pandas as pd
plt.style.use(['default'])
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.covariance import EllipticEnvelope
from sklearn.ensemble import IsolationForest
from sklearn.svm import OneClassSVM
from mpl_toolkits.mplot3d import Axes3D
# !pip install pyemma
# from pyemma import msm
# %matplotlib inline
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
train = pd.read_csv('data/expedia-hotel/train.csv')
print(train.shape)
train.head()
(9917530, 54)
| srch_id | date_time | site_id | visitor_location_country_id | visitor_hist_starrating | visitor_hist_adr_usd | prop_country_id | prop_id | prop_starrating | prop_review_score | ... | comp6_rate_percent_diff | comp7_rate | comp7_inv | comp7_rate_percent_diff | comp8_rate | comp8_inv | comp8_rate_percent_diff | click_bool | gross_bookings_usd | booking_bool | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 2013-04-04 08:32:15 | 12 | 187 | NaN | NaN | 219 | 893 | 3 | 3.5 | ... | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | 0 | NaN | 0 |
| 1 | 1 | 2013-04-04 08:32:15 | 12 | 187 | NaN | NaN | 219 | 10404 | 4 | 4.0 | ... | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | 0 | NaN | 0 |
| 2 | 1 | 2013-04-04 08:32:15 | 12 | 187 | NaN | NaN | 219 | 21315 | 3 | 4.5 | ... | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | 0 | NaN | 0 |
| 3 | 1 | 2013-04-04 08:32:15 | 12 | 187 | NaN | NaN | 219 | 27348 | 2 | 4.0 | ... | NaN | NaN | NaN | NaN | -1.0 | 0.0 | 5.0 | 0 | NaN | 0 |
| 4 | 1 | 2013-04-04 08:32:15 | 12 | 187 | NaN | NaN | 219 | 29604 | 4 | 3.5 | ... | NaN | NaN | NaN | NaN | 0.0 | 0.0 | NaN | 0 | NaN | 0 |
5 rows × 54 columns
train.isnull().sum()
srch_id 0 date_time 0 site_id 0 visitor_location_country_id 0 visitor_hist_starrating 9412233 visitor_hist_adr_usd 9409918 prop_country_id 0 prop_id 0 prop_starrating 0 prop_review_score 14630 prop_brand_bool 0 prop_location_score1 0 prop_location_score2 2178380 prop_log_historical_price 0 position 0 price_usd 0 promotion_flag 0 srch_destination_id 0 srch_length_of_stay 0 srch_booking_window 0 srch_adults_count 0 srch_children_count 0 srch_room_count 0 srch_saturday_night_bool 0 srch_query_affinity_score 9281966 orig_destination_distance 3216461 random_bool 0 comp1_rate 9681724 comp1_inv 9663097 comp1_rate_percent_diff 9732623 comp2_rate 5876897 comp2_inv 5665992 comp2_rate_percent_diff 8807683 comp3_rate 6858257 comp3_inv 6625309 comp3_rate_percent_diff 8973523 comp4_rate 9297431 comp4_inv 9225059 comp4_rate_percent_diff 9653317 comp5_rate 5473236 comp5_inv 5196697 comp5_rate_percent_diff 8236524 comp6_rate 9435043 comp6_inv 9393385 comp6_rate_percent_diff 9724218 comp7_rate 9286453 comp7_inv 9204355 comp7_rate_percent_diff 9639692 comp8_rate 6098487 comp8_inv 5957142 comp8_rate_percent_diff 8691823 click_bool 0 gross_bookings_usd 9640938 booking_bool 0 dtype: int64
Data Set : Expedia Hotel
“Hotel” refers to hotels, apartments, B&Bs, hostels and other properties appearing on Expedia’s websites. Room types are not distinguished and the data can be assumed to apply to the least expensive room type. Most of the data are for searches that resulted in a purchase, but a small proportion are for searches not leading to a purchase. So, the main objective is to check the prices of Hotel Rooms.
Selecting property / visitor location country / srch_room_count with the most Data Points
# prop_id corresponding to
train['prop_id'].value_counts()
prop_id
104517 4733
124342 4707
68420 4580
134154 4550
40279 4535
...
115966 1
130054 1
131174 1
125775 1
102926 1
Name: count, Length: 136886, dtype: int64
train['visitor_location_country_id'].value_counts()
visitor_location_country_id
219 5778805
100 990487
55 580072
216 434568
220 350433
...
144 31
198 30
146 30
24 29
8 27
Name: count, Length: 218, dtype: int64
# Num of rooms specified in search by customer
train['srch_room_count'].value_counts()
srch_room_count 1 9045780 2 734315 3 92372 4 25023 5 8053 8 4950 6 4345 7 2692 Name: count, dtype: int64
# Subset df
df = train.loc[train['prop_id'] == 104517]
df = df.loc[df['visitor_location_country_id'] == 219]
df = df.loc[df['srch_room_count'] == 1]
# srch_saturday = if stay includes Sat night
# srch_booking_window = num of days between search date and hotel stay start date
df = df[['date_time', 'price_usd', 'srch_booking_window', 'srch_saturday_night_bool']]
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 3049 entries, 2041 to 9917395 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_time 3049 non-null object 1 price_usd 3049 non-null float64 2 srch_booking_window 3049 non-null int64 3 srch_saturday_night_bool 3049 non-null int64 dtypes: float64(1), int64(2), object(1) memory usage: 119.1+ KB
df.describe()
| price_usd | srch_booking_window | srch_saturday_night_bool | |
|---|---|---|---|
| count | 3049.000000 | 3049.000000 | 3049.000000 |
| mean | 112.939023 | 37.082650 | 0.524434 |
| std | 113.374049 | 46.160272 | 0.499485 |
| min | 0.120000 | 0.000000 | 0.000000 |
| 25% | 67.000000 | 6.000000 | 0.000000 |
| 50% | 100.000000 | 20.000000 | 1.000000 |
| 75% | 141.000000 | 48.000000 | 1.000000 |
| max | 5584.000000 | 292.000000 | 1.000000 |
train.loc[(train['price_usd'] == 5584) &
(train['visitor_location_country_id'] == 219)]
| srch_id | date_time | site_id | visitor_location_country_id | visitor_hist_starrating | visitor_hist_adr_usd | prop_country_id | prop_id | prop_starrating | prop_review_score | ... | comp6_rate_percent_diff | comp7_rate | comp7_inv | comp7_rate_percent_diff | comp8_rate | comp8_inv | comp8_rate_percent_diff | click_bool | gross_bookings_usd | booking_bool | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2905344 | 195154 | 2013-04-07 20:59:07 | 5 | 219 | NaN | NaN | 219 | 104517 | 4 | 4.0 | ... | NaN | NaN | NaN | NaN | -1.0 | 0.0 | 28.0 | 0 | NaN | 0 |
1 rows × 54 columns
👉🏻 Possible wrong search, no intention to book
# Removing 5584
df = df.loc[df['price_usd'] < 5584]
df['price_usd'].describe()
count 3048.000000 mean 111.144055 std 55.055161 min 0.120000 25% 67.000000 50% 100.000000 75% 141.000000 max 536.000000 Name: price_usd, dtype: float64
df['date_time'].min(), df['date_time'].max()
('2012-11-01 02:48:30', '2013-06-30 22:50:21')
df['date_time']
2041 2013-03-14 11:27:28
3152 2013-01-03 20:48:24
5166 2013-01-19 16:51:27
10135 2013-01-26 11:34:23
10560 2013-04-29 09:39:50
...
9908584 2013-01-02 19:36:56
9911528 2013-02-23 12:27:10
9915868 2012-11-17 17:53:21
9916580 2013-04-26 18:52:31
9917395 2013-02-23 11:03:49
Name: date_time, Length: 3048, dtype: object
pd.to_datetime(df['date_time'])
2041 2013-03-14 11:27:28
3152 2013-01-03 20:48:24
5166 2013-01-19 16:51:27
10135 2013-01-26 11:34:23
10560 2013-04-29 09:39:50
...
9908584 2013-01-02 19:36:56
9911528 2013-02-23 12:27:10
9915868 2012-11-17 17:53:21
9916580 2013-04-26 18:52:31
9917395 2013-02-23 11:03:49
Name: date_time, Length: 3048, dtype: datetime64[ns]
df['date_time'].describe()
df['date_time'] = pd.to_datetime(df['date_time'])
df.head()
| date_time | price_usd | srch_booking_window | srch_saturday_night_bool | |
|---|---|---|---|---|
| 2041 | 2013-03-14 11:27:28 | 206.0 | 99 | 1 |
| 3152 | 2013-01-03 20:48:24 | 186.0 | 6 | 0 |
| 5166 | 2013-01-19 16:51:27 | 61.0 | 1 | 0 |
| 10135 | 2013-01-26 11:34:23 | 72.0 | 116 | 0 |
| 10560 | 2013-04-29 09:39:50 | 246.0 | 245 | 0 |
import matplotlib.pyplot as plt
import plotly.express as px
fig = px.scatter(df, x=df.columns[0], y=df.columns[1], height=500, width=1200, template='plotly_white')
fig.update_layout(
title="Time series of room price by date of search",
yaxis_title="USD",
xaxis_title="Dates",
)
df.head()
| date_time | price_usd | srch_booking_window | srch_saturday_night_bool | |
|---|---|---|---|---|
| 2041 | 2013-03-14 11:27:28 | 206.0 | 99 | 1 |
| 3152 | 2013-01-03 20:48:24 | 186.0 | 6 | 0 |
| 5166 | 2013-01-19 16:51:27 | 61.0 | 1 | 0 |
| 10135 | 2013-01-26 11:34:23 | 72.0 | 116 | 0 |
| 10560 | 2013-04-29 09:39:50 | 246.0 | 245 | 0 |
a = df.loc[df['srch_saturday_night_bool'] == 0, 'price_usd']
b = df.loc[df['srch_saturday_night_bool'] == 1, 'price_usd']
plt.figure(figsize = (16, 8))
plt.hist(a, bins = 80,
alpha = 0.7,
label = 'search w/o Sat night stay')
plt.hist(b, bins = 80,
alpha = 0.7,
label = 'search w/ Sat night stay')
plt.xlabel('Price')
plt.ylabel('Freq')
plt.legend()
plt.title('Sat night search')
plt.plot();
df['srch_saturday_night_bool'].value_counts()
srch_saturday_night_bool 1 1599 0 1449 Name: count, dtype: int64
print('Kurtosis: %f' % df['price_usd'].kurt())
print('Skewness: %f' % df['price_usd'].skew())
Kurtosis: 3.617559 Skewness: 1.443641
plt.style.use(['default'])
sns.distplot(df['price_usd'],
hist = False, label = 'USD')
sns.distplot(df['srch_booking_window'],
hist = False, label = 'booking window')
plt.xlabel('dist')
sns.despine()
sns.distplot(a, hist = False, rug = False)
sns.distplot(b, hist = False, rug = False)
sns.despine()
import numpy as np
df = df.sort_values('date_time')
df['date_time_int'] = df.date_time.astype('int64')
df['date_time_int']
3945840 1351738110000000000
63387 1351739203000000000
3352426 1351760658000000000
5257418 1351761063000000000
7091061 1351764925000000000
...
2792991 1372575836000000000
9363497 1372606952000000000
143486 1372610620000000000
2478763 1372622832000000000
3800163 1372632621000000000
Name: date_time_int, Length: 3048, dtype: int64
Cluster-based models
- K-Means
- Isolation forest
- Clustering
Potential outliers
- usd
- srch_booking_window (days between search and first stay date)
- srch_saturday (stay includes sat night
2.1 K-Means¶
This method looks at the data points in a dataset and groups those that are similar into a predefined number K of clusters. A threshold value can be added to detect anomalies: if the distance between a data point and its nearest centroid is greater than the threshold value, then it is an anomaly.
# Determine optimal cluster num using elbow method
data = df[['price_usd', 'srch_booking_window', 'srch_saturday_night_bool']]
n_cluster = range(1, 20)
kmeans = [KMeans(n_clusters = i).fit(data) for i in n_cluster]
scores = [kmeans[i].score(data) for i in range(len(kmeans))]
# elbow curve
fig, ax = plt.subplots(figsize = (16, 8))
ax.plot(n_cluster, scores, color = 'orange')
plt.xlabel('clusters num')
plt.ylabel('score')
plt.title('elbow curve')
plt.show();
👉🏻 Setting n_clusters to 7
n_clusters > 7 = additional clusters do not explain greater variance in variable
where variable = price_usd.
# k means output
X = df[['price_usd', 'srch_booking_window', 'srch_saturday_night_bool']]
X = X.reset_index(drop = True)
km = KMeans(n_clusters = 7)
km.fit(X)
km.predict(X)
labels = km.labels_
X.head()
| price_usd | srch_booking_window | srch_saturday_night_bool | |
|---|---|---|---|
| 0 | 84.0 | 19 | 0 |
| 1 | 78.0 | 16 | 1 |
| 2 | 114.0 | 56 | 1 |
| 3 | 76.0 | 56 | 1 |
| 4 | 128.0 | 0 | 1 |
X["Cluster"] = labels
X
| price_usd | srch_booking_window | srch_saturday_night_bool | Cluster | |
|---|---|---|---|---|
| 0 | 84.0 | 19 | 0 | 3 |
| 1 | 78.0 | 16 | 1 | 3 |
| 2 | 114.0 | 56 | 1 | 6 |
| 3 | 76.0 | 56 | 1 | 6 |
| 4 | 128.0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... |
| 3043 | 152.0 | 11 | 1 | 5 |
| 3044 | 56.0 | 172 | 1 | 1 |
| 3045 | 68.0 | 8 | 0 | 3 |
| 3046 | 169.0 | 27 | 1 | 5 |
| 3047 | 122.0 | 32 | 1 | 0 |
3048 rows × 4 columns
2.2 3D Clusters¶
Plotting using K-Means O/P
from src.analysis import plotters
# 颜色映射,根据实际聚类的数量可以修改
color_mapping = {
0: "red", 1: "green", 2: "blue", 3: "yellow",
4: "purple", 5: "orange", 6: "black"
}
# 将每个聚类标签映射到一个颜色
colors = [color_mapping[cluster] for cluster in X["Cluster"]]
plotters.plot_3d_scatter(
x=X["price_usd"],
y=X["srch_booking_window"],
z=X["srch_saturday_night_bool"],
hover_text=X["Cluster"].tolist(),
color_list=colors,
title= "3 Component PCA",
camera_pos = {"x": 0.5, "y": 1, "z": 1}
)
import plotly.express as px
fig = px.scatter_3d(X, x='price_usd', y='srch_booking_window', z='srch_saturday_night_bool',
color='Cluster')
fig.show()
pca 异常检测的例子
PCA异常检测的逻辑 重构误差:当数据通过PCA被压缩到主成分后,然后又被重构回原来的维度时,将会产生一些误差。这种误差称为重构误差,它基本上衡量了数据点与PCA模型的适应度。如果一个数据点很好地适应了模型(即在主成分定义的子空间中),那么其重构误差应该比较低。
异常检测:重构误差较高的数据点可能表明它们不符合数据的主要趋势和结构。因此,这些数据点可能是异常的,因为它们不能被PCA模型中的主要成分所很好地表示。
设置阈值 阈值的确定:通常,阈值是基于重构误差的分布来确定的。选择95%百分位数意味着,我们将允许5%的数据点被视为异常。这些点的重构误差高于95%的其他数据点的重构误差。 为什么是95%:这个阈值不是固定的,而是可以根据具体的应用和对异常检测敏感性的需求来调整。95%是一个常见的选择,因为它提供了一个相对严格的标准,但同时避免了将太多的正常数据误分类为异常。
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
# 示例数据
data = np.random.randn(100, 5) # 100个样本,5个特征
'''
[[-1.48187827e-01 -1.75668870e-01 6.83403529e-02 -9.59258255e-01
-1.06998865e-01]
[-1.20671500e+00 -2.71626282e+00 -5.73575246e-01 1.06723512e+00
-1.07234300e+00]
'''
# 步骤1: 数据标准化
scaler = StandardScaler()
data_normalized = scaler.fit_transform(data)
# 步骤2, 3, 4: 应用PCA
pca = PCA(n_components=2) # 使用两个主成分
pca.fit(data_normalized)
# 步骤5: 投影数据到主成分
data_projected = pca.transform(data_normalized)
# 步骤6: 重构原始数据并计算重构误差
data_reconstructed = pca.inverse_transform(data_projected)
reconstruction_error = np.sum((data_normalized - data_reconstructed) ** 2, axis=1)
# 步骤7: 确定异常阈值和标记异常
threshold = np.percentile(reconstruction_error, 95) # 设置95%的阈值
outliers = reconstruction_error > threshold
# 绘图展示
plt.figure(figsize=(12, 6))
# 绘制原始数据投影
plt.subplot(1, 2, 1)
plt.scatter(data_projected[:, 0], data_projected[:, 1], c='blue', label='Normal')
plt.scatter(data_projected[outliers, 0], data_projected[outliers, 1], c='red', label='Outlier')
plt.title('Projected Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
# 绘制重构误差
plt.subplot(1, 2, 2)
plt.bar(range(len(reconstruction_error)), reconstruction_error, color='blue')
plt.bar(np.where(outliers)[0], reconstruction_error[outliers], color='red')
plt.title('Reconstruction Error by Observation')
plt.xlabel('Observation')
plt.ylabel('Reconstruction Error')
plt.tight_layout()
plt.show()
print("Reconstruction error:", reconstruction_error)
print("Outliers detected:", outliers)
Reconstruction error: [ 3.7911034 6.49464587 1.87799868 1.26621247 0.59204335 3.89571909 0.85042552 0.61084528 1.87135888 1.49638359 2.53943672 0.91533827 0.13979625 0.09245408 4.86394317 5.96538444 0.24425115 1.0896504 1.58257526 1.41836734 1.08945828 0.86515221 3.79831011 4.12517792 0.73788622 1.48127554 0.6994028 1.70378853 3.8711495 5.3653281 6.21350682 0.89969119 2.39748873 1.2015869 3.27881821 1.72333517 1.29119666 3.7662234 0.53211156 2.07051893 1.33083755 3.00480613 0.56013221 2.10952402 1.9549264 2.06542904 2.2840912 3.95570685 0.43143032 2.04027108 1.3833085 2.15998642 0.2412585 1.0011954 3.3615319 3.46247204 0.52839811 2.24982511 3.20045285 1.67232051 11.50709909 2.30764427 5.63768298 1.69754245 6.16694628 5.56568183 1.12144372 2.2299576 0.2434963 3.84203372 2.29908934 8.0824526 4.68802223 4.12000316 6.116234 1.2102599 2.93235621 2.30895709 0.64902087 3.12205028 1.98454411 0.81056452 1.60021985 1.7276939 9.185004 3.26020649 3.72799263 1.09582522 3.45882254 1.57878253 1.75989294 7.96643566 0.8585968 3.10464773 0.3822829 2.01576139 4.58692594 1.04852569 3.16606458 4.28682984] Outliers detected: [False True False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False False True False False False False False False False False False False True False False False False False False False False False False False False True False False False False False False True False False False False False False False False]
import pylab as pl
2.3 PCA¶
The PCA-Based Anomaly Detection component solves the problem by analyzing available features to determine what constitutes a "normal" class. The component then applies distance metrics to identify cases that represent anomalies. This approach lets you train a model by using existing imbalanced data.
data = df[['price_usd', 'srch_booking_window', 'srch_saturday_night_bool']]
X = data.values
"""
array([[ 84., 19., 0.],
[ 78., 16., 1.],
[114., 56., 1.],
"""
# 1. Standardize the data
X_std = StandardScaler().fit_transform(X)
# 2. Calc eigenvec cor & eig_vals of covar matrix
mean_vec = np.mean(X_std, axis = 0)
cov_mat = np.cov(X_std.T)
eig_vals, eig_vecs = np.linalg.eig(cov_mat)
# eig_val,eig_vecs tuple
eig_pairs = [(np.abs(eig_vals[i]), eig_vecs[:,i]) for i in range(len(eig_vals))]
eig_pairs.sort(key = lambda x: x[0], reverse = True)
# Calc explained var from eig_vals
total = sum(eig_vals)
# Individual explained var
var_exp = [(i/total)*100 for i in sorted(eig_vals, reverse = True)]
# Cumulative explained var
cum_var_exp = np.cumsum(var_exp)
print("Cumulative explained variance: ", cum_var_exp)
Cumulative explained variance: [ 47.10123543 80.66253687 100. ]
plt.figure(figsize = (16, 8))
plt.bar(range(len(var_exp)), var_exp,
alpha = 0.5, align = 'center',
label = 'individual explained var',
color = 'r'
)
plt.step(range(len(cum_var_exp)), cum_var_exp,
where = 'mid',
label = 'cumulative explained var')
plt.xlabel('principal components')
plt.ylabel('explained var ratio')
plt.legend(loc = 'best')
plt.show();
👉🏻 Component
1 = explains approx 50% of var
2 = explains < 40
3 = explains < 20
Components 1 + 2 = explain approx 80% of var
set n_components = 2
standardize features
pandas中一个方括号和两个方括号的区别¶
一个方括号返回的是一个Series,两个方括号返回的是一个DataFrame。
import pandas as pd
test = False
if test:
# 创建一个示例DataFrame
data = {
'price_usd': [200, 150, 120, 180, 220],
'srch_booking_window': [14, 3, 20, 1, 5],
'srch_saturday_night_bool': [True, False, True, False, True],
'other_info': [1, 2, 3, 4, 5] # 一个额外的列,我们不会在选择中包括它
}
df = pd.DataFrame(data)
# 使用单个方括号选择单列
selected_column = df['price_usd']
print("单个方括号返回的是series: \n", selected_column)
# 使用双方括号选择多列
selected_columns = df[['price_usd', 'srch_booking_window', 'srch_saturday_night_bool']]
print("\n双方括号返回的是dataframe:\n", selected_columns)
data = df[['price_usd', 'srch_booking_window', 'srch_saturday_night_bool']]
data
| price_usd | srch_booking_window | srch_saturday_night_bool | |
|---|---|---|---|
| 3945840 | 84.0 | 19 | 0 |
| 63387 | 78.0 | 16 | 1 |
| 3352426 | 114.0 | 56 | 1 |
| 5257418 | 76.0 | 56 | 1 |
| 7091061 | 128.0 | 0 | 1 |
| ... | ... | ... | ... |
| 2792991 | 152.0 | 11 | 1 |
| 9363497 | 56.0 | 172 | 1 |
| 143486 | 68.0 | 8 | 0 |
| 2478763 | 169.0 | 27 | 1 |
| 3800163 | 122.0 | 32 | 1 |
3048 rows × 3 columns
data = df[['price_usd', 'srch_booking_window', 'srch_saturday_night_bool']]
# 1. Standardize features
X_std = StandardScaler().fit_transform(X)
data = pd.DataFrame(X_std)
data_normalized = data
# 2. Reduce components to 2 and 投影数据到主成分
pca = PCA(n_components = 2)
data = pca.fit_transform(data)
data_projected = data
# 3. Standardize 2 new features
'''
在第三步中,
已经通过PCA转换的数据被再次标准化。
具体来说,这一步骤主要目的是:
确保主成分的标准化:虽然PCA的输入数据(原始特征)在进行PCA之前已经标准化过,
但PCA转换后的输出(即主成分得分)不一定是标准化的。
主成分是数据的一个线性组合,这些组合的尺度(均值和方差)可能与原始标准化的特征不同。
均一化数据处理:
再次标准化可以确保所有主成分都具有相同的尺度(即均值为0,标准差为1)。
这对于后续的数据处理和分析非常重要,尤其是在应用机器学习算法时,这些算法往往对输入数据的尺度非常敏感。
增强模型性能:
对于某些特定的分析和算法(比如聚类、基于距离的异常检测等),
确保所有特征具有相同的重要性(无量纲化)是提高模型性能的关键。
'''
scaler = StandardScaler()
np_scaled = scaler.fit_transform(data)
data = pd.DataFrame(np_scaled)
重构数据检测异常值 方法1¶
# 步骤6: 重构原始数据并计算重构误差
data_reconstructed = pca.inverse_transform(data_projected)
reconstruction_error = np.sum((data_normalized - data_reconstructed) ** 2, axis=1)
# 步骤7: 确定异常阈值和标记异常
threshold = np.percentile(reconstruction_error, 99) # 设置95%的阈值
outliers = reconstruction_error > threshold
'''
Outliers detected:
0 False
1 False
2 False
3 False
4 False
...
3043 False
3044 False
3045 False
3046 False
3047 False
Length: 3048, dtype: bool
'''
# 绘图展示
plt.figure(figsize=(12, 6))
# 绘制原始数据投影
plt.subplot(1, 2, 1)
plt.scatter(data_projected[:, 0], data_projected[:, 1], c='blue', label='Normal')
plt.scatter(data_projected[outliers, 0], data_projected[outliers, 1], c='red', label='Outlier')
plt.title('Projected Data')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
# 绘制重构误差
plt.subplot(1, 2, 2)
plt.bar(range(len(reconstruction_error)), reconstruction_error, color='blue')
plt.bar(np.where(outliers)[0], reconstruction_error[outliers], color='red')
plt.title('Reconstruction Error by Observation')
plt.xlabel('Observation')
plt.ylabel('Reconstruction Error')
plt.tight_layout()
plt.show()
df_anomaly = df.copy()
df_anomaly['anomaly'] = outliers
anomaly_df = df_anomaly.loc[df_anomaly['anomaly'] == True, ['date_time_int', 'price_usd']]
print("anomaly_df", anomaly_df.shape)
anomaly_df (31, 2)
fig, ax = plt.subplots(figsize = (10, 5))
ax.plot(df['date_time_int'], df['price_usd'],
color = 'lightgreen', label = 'Normal')
a = df.loc[df_anomaly['anomaly'] == 1,
['date_time_int', 'price_usd']]
ax.scatter(a['date_time_int'], a['price_usd'],
color = 'red', label = 'Anomaly')
plt.xlabel('time')
plt.ylabel('USD')
plt.legend()
plt.show()
df_anomaly['anomaly'].value_counts()
anomaly False 3017 True 31 Name: count, dtype: int64
重构数据检测异常值 方法2¶
# 对于标准化后的新投影空间的数据进行聚类
kmeans = [KMeans(n_clusters = i).fit(data) for i in n_cluster]
scores = [kmeans[i].score(data) for i in range(len(kmeans))]
# elbow curve
fig, ax = plt.subplots(figsize = (8, 5))
ax.plot(n_cluster, scores, color = 'orange')
plt.xlabel('clusters num')
plt.ylabel('score')
plt.title('elbow curve')
plt.show()
df
| date_time | price_usd | srch_booking_window | srch_saturday_night_bool | date_time_int | |
|---|---|---|---|---|---|
| 3945840 | 2012-11-01 02:48:30 | 84.0 | 19 | 0 | 1351738110000000000 |
| 63387 | 2012-11-01 03:06:43 | 78.0 | 16 | 1 | 1351739203000000000 |
| 3352426 | 2012-11-01 09:04:18 | 114.0 | 56 | 1 | 1351760658000000000 |
| 5257418 | 2012-11-01 09:11:03 | 76.0 | 56 | 1 | 1351761063000000000 |
| 7091061 | 2012-11-01 10:15:25 | 128.0 | 0 | 1 | 1351764925000000000 |
| ... | ... | ... | ... | ... | ... |
| 2792991 | 2013-06-30 07:03:56 | 152.0 | 11 | 1 | 1372575836000000000 |
| 9363497 | 2013-06-30 15:42:32 | 56.0 | 172 | 1 | 1372606952000000000 |
| 143486 | 2013-06-30 16:43:40 | 68.0 | 8 | 0 | 1372610620000000000 |
| 2478763 | 2013-06-30 20:07:12 | 169.0 | 27 | 1 | 1372622832000000000 |
| 3800163 | 2013-06-30 22:50:21 | 122.0 | 32 | 1 | 1372632621000000000 |
3048 rows × 5 columns
data
| 0 | 1 | |
|---|---|---|
| 0 | -0.889864 | -0.521900 |
| 1 | 0.230566 | -0.272218 |
| 2 | 0.567439 | 0.546642 |
| 3 | 0.155659 | 0.581776 |
| 4 | 0.793675 | -0.659305 |
| ... | ... | ... |
| 3043 | 1.039107 | -0.447155 |
| 3044 | -0.215446 | 3.071488 |
| 3045 | -1.048605 | -0.741446 |
| 3046 | 1.202030 | -0.122015 |
| 3047 | 0.686070 | 0.027958 |
3048 rows × 2 columns
# 因为上面已经知道7是最佳聚类数,所以这里直接取7
df['cluster'] = kmeans[7].predict(data)
# data现在是投影到新空间的数据,而且进行了标准化
# df是原始数据
# 下面的代码是将聚类结果映射到原始数据上,因为添加了新的cluster列,
# 所以需要将新的cluster列映射到原始数据上。
df.index = data.index
df['pc1'] = data[0]
df['pc2'] = data[1]
df['cluster'].value_counts()
cluster 0 786 6 712 2 444 1 394 4 352 5 150 7 124 3 86 Name: count, dtype: int64
def getDistanceByPoint(data, model):
distance = pd.Series()
for i in range(0,len(data)):
Xa = np.array(data.loc[i])
Xb = model.cluster_centers_[model.labels_[i]]
distance.at[i] = np.linalg.norm(Xa-Xb)
return distance
# 异常值比例
outliers_fraction = 0.01
distance = getDistanceByPoint(data, kmeans[7])
'''
0 0.196265
1 0.244399
2 0.172127
3 0.552738
4 0.447390
...
3043 0.439126
3044 0.646786
3045 0.074989
3046 0.428241
3047 0.392277
Length: 3048, dtype: float64
'''
# 根据设定的异常比例计算出应该标记为异常的数据点的数量。
outlier_num = int(outliers_fraction * len(distance))
# 确定将数据点标记为异常的距离阈值。这一步通过选取最大的 outlier_num 个
# 距离中的最小值来实现,即距离最大的1%中的最小值成为阈值。
threshold = distance.nlargest(outlier_num).min()
df['anomaly'] = (distance >= threshold).astype(int)
df.anomaly.value_counts()
anomaly 0 3018 1 30 Name: count, dtype: int64
fig, ax = plt.subplots(figsize = (12, 6))
colors = {0:'lightgreen', 1:'red'}
ax.scatter(df['pc1'], df['pc2'],
c = df['anomaly'].apply(lambda x: colors[x]))
plt.xlabel('pc1')
plt.ylabel('pc2')
plt.show();
df = df.sort_values('date_time')
df['date_time'] = df.date_time.astype(np.int64)
# object with anomalies
a = df.loc[df['anomaly'] == 1,
['date_time_int', 'price_usd']]
a
| date_time_int | price_usd | |
|---|---|---|
| 8 | 1351779652000000000 | 211.0 |
| 96 | 1352380267000000000 | 230.0 |
| 102 | 1352416996000000000 | 225.0 |
| 151 | 1352847902000000000 | 326.0 |
| 416 | 1354869170000000000 | 126.0 |
| 745 | 1356942326000000000 | 425.0 |
| 946 | 1358170816000000000 | 55.0 |
| 997 | 1358406730000000000 | 53.0 |
| 1005 | 1358444151000000000 | 102.0 |
| 1140 | 1359377578000000000 | 87.0 |
| 1157 | 1359523443000000000 | 112.0 |
| 1278 | 1360489270000000000 | 58.0 |
| 1310 | 1360663491000000000 | 349.0 |
| 1756 | 1363505928000000000 | 370.0 |
| 1759 | 1363523387000000000 | 320.0 |
| 1948 | 1364679638000000000 | 66.0 |
| 1973 | 1364920263000000000 | 404.0 |
| 2242 | 1366838075000000000 | 142.0 |
| 2305 | 1367409800000000000 | 46.0 |
| 2457 | 1368561390000000000 | 536.0 |
| 2560 | 1369236899000000000 | 62.0 |
| 2770 | 1370621931000000000 | 319.0 |
| 2785 | 1370710670000000000 | 366.0 |
| 2790 | 1370774846000000000 | 319.0 |
| 2796 | 1370857900000000000 | 281.0 |
| 2863 | 1371302820000000000 | 324.0 |
| 2867 | 1371367527000000000 | 434.0 |
| 2903 | 1371545011000000000 | 282.0 |
| 2974 | 1371892824000000000 | 319.0 |
| 3036 | 1372425746000000000 | 98.0 |
fig, ax = plt.subplots(figsize = (10, 5))
ax.plot(df['date_time_int'], df['price_usd'],
color = 'lightgreen', label = 'Normal')
ax.scatter(a['date_time_int'], a['price_usd'],
color = 'red', label = 'Anomaly')
plt.xlabel('time')
plt.ylabel('USD')
plt.legend()
plt.show();
df['anomaly'].unique()
array([0, 1])
a = df.loc[df['anomaly'] == 0, 'price_usd']
b = df.loc[df['anomaly'] == 1, 'price_usd']
fig, axs = plt.subplots(figsize = (10, 5))
axs.hist([a, b],
bins = 50, stacked = True,
color = ['lightgreen', 'red'])
plt.show();
df.anomaly.value_counts()
anomaly 0 3018 1 30 Name: count, dtype: int64
2.4 Isolation Forest¶
Detect anomalies based on data points that are few and different
No use of density / distance measure i.e. different from clustering based / distanced based algorithms
Randomly select a feature
Randomly select a split between max and min values of selected feature
Length of path, avged over a forest of random trees = measure of normality
Random partitioning = shorter path for anomalies
If forest produces shorter paths for samples, then they are likely to be anomalies
In Isolation forest on the left.. we can isolate an anomalous point from the rest of the data with just one line, while the normal point on the right requires four lines to isolate completely.
Isolation Forest Algorithm takes advantage of the following properties of anomalous samples -
Fewness — anomalous samples are a minority and there will only be a few of them in any dataset.
Different — anomalous samples have values/attributes that are very different from those of normal samples.
We can see, Outliers are easier to isolate, while Inliers are harders to isolate..
异常值更容易隔离,而内值更难隔离。
data = df[['price_usd', 'srch_booking_window', 'srch_saturday_night_bool']]
# 1. Standardize features
scaler = StandardScaler()
np_scaled = scaler.fit_transform(data)
data = pd.DataFrame(np_scaled)
# 2. Isolation forest
outliers_fraction = 0.01
# ifo is the isolation forest model
ifo = IsolationForest(contamination = outliers_fraction)
ifo.fit(data)
# Predict the anomaly score
# add the data to the main df
# Anomaly point is marked as -1, and normal point is marked as 1
df['anomaly1'] = pd.Series(ifo.predict(data))
# visualization
fig, ax = plt.subplots(figsize = (10, 5))
a = df.loc[df['anomaly1'] == -1, ['date_time_int', 'price_usd']]
ax.plot(df['date_time_int'], df['price_usd'],
color = 'lightgreen', label = 'Normal')
ax.scatter(a['date_time_int'], a['price_usd'],
color = 'red', label = 'Anomaly')
plt.legend()
plt.show();
df['anomaly1'].unique()
array([ 1, -1])
a = df.loc[df['anomaly1'] == 1, 'price_usd']
b = df.loc[df['anomaly1'] == -1, 'price_usd']
fig, ax = plt.subplots(figsize = (10, 5))
ax.hist([a, b],
bins = 50, stacked = True,
color = ['lightgreen', 'red'] )
plt.show();
df['anomaly1'].value_counts()
anomaly1 1 3017 -1 31 Name: count, dtype: int64
Support vector machine models
Associated with supervised learning
- One class SVM
- Gaussian distribution
- Markov chain
2.5 One class SVM¶
- Unsupervised Anomaly Detection Method
- Estimate support of high dimensional distribution
data = df[['price_usd', 'srch_booking_window', 'srch_saturday_night_bool']]
scaler = StandardScaler()
np_scaled = scaler.fit_transform(data)
data = pd.DataFrame(np_scaled)
# Train
# outliers_fraction is the percentage of outliers in the data
# is defined as 1% in this case
osvm = OneClassSVM(nu = outliers_fraction,
kernel = 'rbf',
gamma = 0.01)
osvm.fit(data)
df['anomaly2'] = pd.Series(osvm.predict(data))
fig, ax = plt.subplots(figsize = (10, 5))
a = df.loc[df['anomaly2'] == -1, ['date_time_int', 'price_usd']]
ax.plot(df['date_time_int'], df['price_usd'], color = 'lightgreen', label = 'Normal')
ax.scatter(a['date_time_int'], a['price_usd'], color = 'red', label = 'Anomaly')
plt.legend()
plt.show();
a = df.loc[df['anomaly2'] == 1, 'price_usd']
b = df.loc[df['anomaly2'] == -1, 'price_usd']
fig, ax = plt.subplots(figsize = (10, 5))
ax.hist([a, b], bins = 50,
stacked = True, color = ['lightgreen','red'])
plt.show();
df['anomaly2'].value_counts()
anomaly2 1 3017 -1 31 Name: count, dtype: int64
2.6 Gaussian Distribution¶
Assume data is normally distributed
Use covariance.EllipticEnvelope from scikit-learn to find key params of general distribution by assuming entire dataset = an expression of an underlying multivariate Gaussian distribution
Creating two dfs based on categories defined by sat boolean
df_class0 = df.loc[df['srch_saturday_night_bool'] == 0, 'price_usd']
df_class1 = df.loc[df['srch_saturday_night_bool'] == 1, 'price_usd']
fig, axs = plt.subplots(1, 2)
df_class0.hist(ax = axs[0], bins = 50, color = 'lightgreen')
df_class1.hist(ax = axs[1], bins = 50, color = 'red');
Apply EllipticEnvelope to each category
Set contamination param (proportion of outliers present in dataset)
Use decision function to compute decision function of given observations (equivalent to shifted Mahalanobis distances.
Threshold for identifying as outliers = 0 (compatible with other detection algorithms)
predict(x_train) predict labels of X_train according to fitted model
1 = normal -1 = anomaly
envelope = EllipticEnvelope(contamination = outliers_fraction)
x_train = df_class0.values.reshape(-1, 1)
envelope.fit(x_train)
df_class0 = pd.DataFrame(df_class0)
df_class0['deviation'] = envelope.decision_function(x_train)
df_class0['anomaly'] = envelope.predict(x_train)
envelope = EllipticEnvelope(contamination = outliers_fraction)
x_train = df_class1.values.reshape(-1, 1)
envelope.fit(x_train)
df_class1 = pd.DataFrame(df_class1)
df_class1['deviation'] = envelope.decision_function(x_train)
df_class1['anomaly'] = envelope.predict(x_train)
df_class = pd.concat([df_class0, df_class1])
df['anomaly3'] = df_class['anomaly']
fig, ax = plt.subplots(figsize = (10, 5))
a = df.loc[df['anomaly3'] == -1,
('date_time_int', 'price_usd')]
ax.plot(df['date_time_int'], df['price_usd'],
color = 'lightgreen')
ax.scatter(a['date_time_int'], a['price_usd'],
color = 'red')
plt.show();
df['anomaly3'].value_counts()
anomaly3 1 3017 -1 31 Name: count, dtype: int64
a = df.loc[df['anomaly3'] == 1, 'price_usd']
b = df.loc[df['anomaly3'] == -1, 'price_usd']
fig, ax = plt.subplots(figsize = (10, 5))
ax.hist([a, b],
bins = 50, stacked = True,
color = ['lightgreen', 'red'])
plt.show();
2.7 Analysis 📝¶
👉🏻 Based on this study, we have observed that Algorithms - K-Means, 3D Cluster, PCA, Isolation Forest, and Gaussian Distribution have detected the high prices.
👉🏻 But only One Class SVM has detected both high prices as well as low prices.
❓ BUT.. How one class svm made such difference?
SVM is for novelty detection, a max-margin methods, i.e. they do not model a probability distribution... The idea of novelty detection is to detect rare events, i.e. events that happen rarely, and hence, of which you have very little samples.

🕵🏻 Anomaly Detection: Sensors | VAR
3️⃣ Problem Statement 3: Sensors🚦¶
Detect the outliers/anomalies in the experimental dataset. The dataset stores hourly counting series detected by sensors. These sensors count both people riding bikes and pedestrians. Separate volumes are tallied for each travel mode. Wires in a diamond formation in the concrete detect bikes and an infrared sensor mounted on a wooden post detects pedestrians.
import scipy.stats as stats
import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.vector_ar.var_model import VAR
import warnings
warnings.simplefilter('ignore')
Data Set
Using a modified dataset version of Seattle Burke Gilman Trail (i.e. hosted by the city of Seattle which is part of its open data project).
The dataset stores hourly counting series detected by sensors. These sensors count both people riding bikes and pedestrians. Separate volumes are tallied for each travel mode. Wires in a diamond formation in the concrete detect bikes and an infrared sensor mounted on a wooden post detects pedestrians.
### READ DATA ###
train_hours = 80*7*24 # weeks x hours x days
test_hours = 15*7*24 # weeks x hours x days
df = pd.read_csv('data/mts-trail-west-of-i-90-bridge.csv', nrows=train_hours+test_hours, parse_dates=['Date'])
print(df.shape)
df.head()
(15960, 6)
| Date | MTS Trl West of I-90 Bridge Total | Ped East | Ped West | Bike East | Bike West | |
|---|---|---|---|---|---|---|
| 0 | 2013-12-18 00:00:00 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 2013-12-18 01:00:00 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2 | 2013-12-18 02:00:00 | 3.0 | 1.0 | 1.0 | 1.0 | 0.0 |
| 3 | 2013-12-18 03:00:00 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 |
| 4 | 2013-12-18 04:00:00 | 2.0 | 0.0 | 0.0 | 2.0 | 0.0 |
### FILL NAN ###
df['Ped East'] = df['Ped East'].groupby(df.Date.dt.hour).transform(lambda x: x.fillna(x.median()))
df['Ped West'] = df['Ped West'].groupby(df.Date.dt.hour).transform(lambda x: x.fillna(x.median()))
df['Bike East'] = df['Bike East'].groupby(df.Date.dt.hour).transform(lambda x: x.fillna(x.median()))
df['Bike West'] = df['Bike West'].groupby(df.Date.dt.hour).transform(lambda x: x.fillna(x.median()))
df['BGT West of NE 70th Total'] = df['Ped East'] + df['Ped West'] + df['Bike East'] + df['Bike West']
df['Date'] = pd.to_datetime(df['Date'].dt.date)
df['Date']
0 2013-12-18
1 2013-12-18
2 2013-12-18
3 2013-12-18
4 2013-12-18
...
15955 2015-10-13
15956 2015-10-13
15957 2015-10-13
15958 2015-10-13
15959 2015-10-13
Name: Date, Length: 15960, dtype: datetime64[ns]
### DAILY AGGREGATION ###
df_day = pd.DataFrame()
df_day['Ped East'] = df.groupby(df.Date)['Ped East'].sum()
df_day['Ped West'] = df.groupby(df.Date)['Ped West'].sum()
df_day['Bike East'] = df.groupby(df.Date)['Bike East'].sum()
df_day['Bike West'] = df.groupby(df.Date)['Bike West'].sum()
df_day['Total'] = df.groupby(df.Date)['BGT West of NE 70th Total'].sum()
df_day.index = pd.DatetimeIndex(df_day.index.values, freq=df_day.index.inferred_freq)
df_day['Total']
2013-12-18 436.0
2013-12-19 298.0
2013-12-20 63.0
2013-12-21 481.0
2013-12-22 265.0
...
2015-10-09 572.0
2015-10-10 297.0
2015-10-11 1249.0
2015-10-12 689.0
2015-10-13 581.0
Freq: D, Name: Total, Length: 665, dtype: float64
def show_traces_column(
df: pd.DataFrame
):
fig = go.Figure()
for col_name in df.columns.to_list()[1:] :
fig.add_trace(go.Scatter(x = df.Date, y = df[col_name], mode='lines', name = col_name))
fig.show()
show_traces_column(df.iloc[:,:])
In total, 5 counting series are supplied. 2 related to pedestrian count, 2 related to bike count, and the total which is the sum of the previous series. There are double counters for pedestrians and bikes because two directions of travel are registered.
### TRAIN SPLIT ###
train = df_day[:(train_hours//24)].copy()
train.shape
(560, 5)
### TOTAL COUNT TRAIN AUTOCORR ###
plt.figure(figsize=(16,6), dpi= 1000)
pd.plotting.autocorrelation_plot(train['Total'])
<Axes: xlabel='Lag', ylabel='Autocorrelation'>
周期信号的自相关图¶
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.graphics.tsaplots import plot_acf
# 创建周期信号
t = np.linspace(0, 10, 500) # 时间向量
period = 2 # 周期设为2秒
amplitude = 1 # 振幅设为1
signal = amplitude * np.sin(2 * np.pi * t / period) # 正弦波信号
# 绘制信号
plt.figure(figsize=(10, 4))
plt.plot(t, signal, label='Periodic Signal')
plt.title('Periodic Signal')
plt.xlabel('Time')
plt.ylabel('Amplitude')
plt.legend()
plt.grid(True)
plt.show()
# 计算并绘制自相关图
plt.figure(figsize=(10, 4))
plot_acf(signal, lags=200, alpha=0.05) # 绘制自相关图,滞后数设为50
plt.title('Autocorrelation of the Periodic Signal')
plt.xlabel('Lags')
plt.ylabel('Autocorrelation')
plt.grid(True)
plt.show()
<Figure size 1000x400 with 0 Axes>
'''
Train 560 rows × 5 columns------------------------------------------
Ped East Ped West Bike East Bike West Total
2013-12-18 90.0 110.0 127.0 109.0 436.0
2013-12-19 26.0 31.0 121.0 120.0 298.0
2013-12-20 13.0 10.0 22.0 18.0 63.0
2013-12-21 50.0 42.0 206.0 183.0 481.0
2013-12-22 23.0 24.0 122.0 96.0 265.0
... ... ... ... ... ...
2015-06-26 27.0 44.0 585.0 292.0 948.0
2015-06-27 21.0 93.0 1011.0 509.0 1634.0
2015-06-28 29.0 159.0 677.0 538.0 1403.0
2015-06-29 8.0 64.0 533.0 279.0 884.0
2015-06-30 35.0 95.0 758.0 445.0 1333.0
train.index----------------------------------------------------------
DatetimeIndex(['2013-12-18', '2013-12-19', '2013-12-20', '2013-12-21',
'2013-12-22', '2013-12-23', '2013-12-24', '2013-12-25',
'2013-12-26', '2013-12-27',
...
'2015-06-21', '2015-06-22', '2015-06-23', '2015-06-24',
'2015-06-25', '2015-06-26', '2015-06-27', '2015-06-28',
'2015-06-29', '2015-06-30'],
dtype='datetime64[ns]', length=560, freq='D')
train.index.month-----------------------------------------------------
Index([12, 12, 12, 12, 12, 12, 12, 12, 12, 12,
...
6, 6, 6, 6, 6, 6, 6, 6, 6, 6],
dtype='int32', length=560)
'''
#---------------------------------------------------------------------
### MONTHLY TRAIN MEAN ###
month_mean_train = train.groupby(train.index.month).mean()
month_mean_train
| Ped East | Ped West | Bike East | Bike West | Total | |
|---|---|---|---|---|---|
| 1 | 63.322581 | 63.225806 | 224.258065 | 193.870968 | 544.677419 |
| 2 | 55.892857 | 56.696429 | 173.232143 | 143.785714 | 429.607143 |
| 3 | 84.500000 | 83.661290 | 280.225806 | 235.806452 | 684.193548 |
| 4 | 134.616667 | 125.366667 | 434.916667 | 341.466667 | 1036.366667 |
| 5 | 190.000000 | 156.903226 | 605.080645 | 455.854839 | 1407.838710 |
| 6 | 103.983333 | 120.366667 | 628.183333 | 446.533333 | 1299.066667 |
| 7 | 143.096774 | 131.806452 | 696.193548 | 447.645161 | 1418.741935 |
| 8 | 244.451613 | 278.870968 | 569.032258 | 404.838710 | 1497.193548 |
| 9 | 119.133333 | 138.533333 | 438.433333 | 347.100000 | 1043.200000 |
| 10 | 68.096774 | 74.322581 | 286.064516 | 249.774194 | 678.258065 |
| 11 | 56.866667 | 54.800000 | 210.200000 | 176.933333 | 498.800000 |
| 12 | 49.444444 | 40.622222 | 167.022222 | 136.333333 | 393.422222 |
### REMOVE LONG TERM SEASONALITY FROM TRAIN ###
train['Ped East'] = train.apply(lambda x: x['Ped East'] - month_mean_train['Ped East'][x.name.month], axis=1)
train['Ped West'] = train.apply(lambda x: x['Ped West'] - month_mean_train['Ped West'][x.name.month], axis=1)
train['Bike East'] = train.apply(lambda x: x['Bike East'] - month_mean_train['Bike East'][x.name.month], axis=1)
train['Bike West'] = train.apply(lambda x: x['Bike West'] - month_mean_train['Bike West'][x.name.month], axis=1)
train['Total'] = train.apply(lambda x: x['Total'] - month_mean_train['Total'][x.name.month], axis=1)
代码解析¶
train['Ped East']: 这部分指定了要操作的DataFrame列。这意味着我们要更新 train DataFrame中的“Ped East”列。
train.apply(..., axis=1): apply 函数在DataFrame上应用一个函数,axis=1 参数确保函数是沿着行应用的(即逐行执行)。这使得每次函数调用都接收到一个表示行的Series对象。
lambda x: ...: 这是一个匿名函数,其中 x 是传递给函数的每行数据(作为Series对象)。lambda 函数对每行进行计算,并返回一个值,这个值将被用来更新对应行的“Ped East”列。
x['Ped East'] - month_mean_train['Ped East'][x.name.month]: 这部分是 lambda 函数的核心计算:
x['Ped East']: 访问当前行的“Ped East”值。
x.name: 这通常返回行的索引。在这个上下文中,假设 train DataFrame的索引是一个日期类型(如DatetimeIndex),那么x.name.month将返回这个日期的月份部分。
month_mean_train['Ped East'][x.name.month]: 这是从另一个DataFrame(或类似结构)month_mean_train中获取的,它存储了“Ped East”在每个月的平均值。month_mean_train['Ped East']访问这些月平均值,[x.name.month]根据当前行的月份选择相应的平均值。
最后,从原始的“Ped East”值中减去对应月份的平均值,结果是调整后的值,消除了季节性影响。
总结 这行代码的作用是对“Ped East”列中的每个值减去其对应月份的平均值,从而消除数据中的季节性(长期季节性)。这种处理对于时间序列分析中控制季节性变量非常有用,特别是当你希望模型关注除季节性影响外的其他模式时。
### TOTAL COUNT TRAIN AUTOCORR ###
plt.figure(figsize=(16,6), dpi= 1000)
pd.plotting.autocorrelation_plot(train['Total'])
<Axes: xlabel='Lag', ylabel='Autocorrelation'>
Passing to a multivariate approach considering all the series and their interaction in the system.
### TRAIN TEST SPLIT IN MULTIVARIATE CASE ###
train_uni = train['Total'].copy()
test_uni = df_day['Total'][(train_hours//24):].copy()
test_uni = test_uni - test_uni.index.month.map(month_mean_train['Total'])
train.drop('Total', inplace=True, axis=1)
额外的关于信息准则的例子¶
http://chuishuju.com/wiki/1442728084766752/1453027158392864#0
在这张图片中,我们看到了一个关于向量自回归(VAR)模型选择的表格,
其中包括了不同滞后期的赤池信息准则(AIC)、
汉纳-奎因信息准则(HQIC)和
施瓦茨贝叶斯信息准则(SBIC)的值。
信息准则用于帮助选择最佳的模型滞后期数,通常选择这些准则值最小的模型。
- 在这个特定的表格中,滞后2期和滞后9期的AIC和HQIC值被标记为显著(带有星号*),
- 这意味着在这两个滞后期数下,模型的信息准则值在比较中相对较低。
- 信息准则值(如AIC、HQIC)越低,表示模型在拟合数据时,有较高的预测精度或较低的信息损失
结论: 根据AIC和HQIC的结果,可以认为滞后2期和滞后9期的模型可能是在所考虑的所有模型中最优。
3.1 Vector AutoRegression¶
VAR 模型是一个随机过程,它将一组时间相关变量表示为它们自己的过去值和组中所有其他变量的过去值的线性函数。
VAR model is a stochastic process that represents a group of time-dependent variables as a linear function of their own past values and the past values of all the other variables in the group.
Here,
VAR training is computed as before selecting the best order minimizing AIC. The data are standardized in the same way to remove the long term seasonality. Not surprisingly the best model is a VAR(7). After testing the independence and normality of residual data, a Hotelling’s T-squared statistic can be computed to detect the early presence of anomalies:
the formula for Hotelling T2. e are the estimated residuals; Sigma is the covariance matrix of e
The application of the T-squared control chart is conducted in two phases: the control limit establishment phase, and the monitoring phase. The first phase focuses on obtaining model residuals so that the calculated control limit can be used in phase two for monitoring the residual process of future anomalies. The control limit for the T-squared control chart is given by:T 平方控制图的应用分为两个阶段:控制限制的建立阶段和监控阶段。
第一阶段重点是获取模型残差,以便在第二阶段用于监控未来异常的残差过程使用计算出的控制限制。
Upper Control Limit statistic
### FIND BEST MULTIVARIATE MODEL ###
AIC = {}
best_aic, best_order = np.inf, 0
for i in range(1,50):
model = VAR(endog=train)
var_result = model.fit(maxlags=i)
AIC[i] = var_result.aic
if AIC[i] < best_aic:
best_aic = AIC[i]
best_order = i
best_model = var_result
print('BEST ORDER', best_order, 'BEST AIC:', best_aic)
var_result = best_model
BEST ORDER 4 BEST AIC: 35.41893067698374
residuals
array([[-0.09650648, -0.10916124, -0.67183983, -0.75718324],
[ 0.11851829, 0.18600151, -0.1638827 , -0.12575004],
[-0.08158685, 0.05461209, 0.28386889, 0.3618246 ],
...,
[-0.29287949, 0.46581561, -0.81886491, 0.20437237],
[-0.75093061, -0.98856677, -0.53039768, -1.12734485],
[-0.01184417, 0.16250366, 1.55509319, 0.7173044 ]])
T
array([6.17327780e-01, 8.71119717e-02, 2.03436277e-01, 6.85021199e-01,
3.28130124e-01, 4.86172456e-01, 9.70086349e-01, 5.12684134e-01,
6.61672207e-01, 5.26302776e-01, 4.07553973e-01, 8.13426361e-01,
1.63195405e-01, 1.30082201e+00, 1.46388780e-01, 2.60744513e-01,
4.83511213e-01, 5.68333177e-01, 3.51320900e-01, 8.45597928e-01,
1.73474585e+00, 1.08570358e+00, 2.64263920e-01, 3.64140733e-02,
2.08173651e-02, 9.64848244e-02, 3.53472794e-01, 2.42261067e+00,
1.37400849e+00, 4.60225360e-01, 3.62350214e-01, 1.28195012e-01,
6.07797616e-02, 1.45177076e-01, 4.98331698e+00, 1.17035462e+00,
1.70451938e+00, 7.66997000e-01, 5.04177845e-01, 8.96481975e-02,
3.25207734e-01, 2.76683098e+00, 2.91993653e-02, 3.17448608e-01,
1.93897246e-01, 2.05283210e+00, 1.51370578e+00, 2.77503852e-01,
1.13268170e-01, 1.66411827e+00, 9.52586308e-02, 5.51283365e-01,
2.84220717e-01, 1.25550982e-01, 3.55417952e-01, 5.18859785e-01,
5.64099685e-01, 1.21207148e-01, 4.42319249e-01, 1.08040501e-01,
1.45716193e-01, 1.60116280e-01, 3.35199725e-01, 1.22647477e-01,
2.46377882e-01, 4.73258430e-01, 6.40278030e-01, 3.72843724e-01,
7.24905067e-01, 1.39442960e+00, 2.10120769e+00, 7.70889766e-01,
8.91235127e-01, 5.88545290e-01, 1.12771301e+00, 3.53262594e-01,
2.64337600e+00, 7.25281745e-01, 1.59810563e+00, 1.78277366e+00,
1.12885844e+00, 1.39260837e-01, 9.29015659e-01, 1.63096295e+00,
4.37431967e+00, 5.12455245e-01, 7.77480276e-01, 4.70584243e-01,
2.27794188e-02, 1.33041070e+00, 4.52018031e+00, 1.63908845e+00,
2.77426640e+00, 2.12852018e+00, 2.49349362e-01, 4.42857555e-01,
1.36787409e+00, 1.19036010e+00, 3.50034840e+00, 2.97126650e-01,
2.52304202e+00, 8.06034422e-01, 1.72664294e+00, 6.19834158e-01,
1.40291555e+00, 6.51719519e+00, 3.60185787e-01, 3.42787429e+00,
3.01908818e-01, 3.18073009e-01, 5.76229662e-01, 5.49434512e+00,
1.61281432e+01, 2.96286644e+00, 5.13450409e-01, 4.30311095e+00,
1.12528511e+00, 3.98243475e-01, 1.29482750e+00, 3.51212330e+00,
2.91413447e+00, 1.36808362e+00, 9.30801948e-01, 1.97705786e+00,
8.51407962e-01, 5.00327016e+00, 5.32314783e+00, 1.44377137e+00,
4.89142661e+00, 1.45932930e+01, 3.54160730e+00, 1.26180902e+00,
1.01141446e+00, 7.37108539e+00, 1.53236038e+00, 3.38431051e-01,
3.11288120e+00, 9.70062407e+00, 1.78325332e+00, 7.45318976e+00,
2.60740875e+00, 7.34803581e+00, 7.96588369e+00, 4.47097347e+00,
2.70758428e+00, 7.87545871e+00, 1.74734799e+00, 1.03317637e+00,
9.07909528e-01, 7.07632813e+00, 3.54333852e+00, 9.50621379e-02,
5.92715731e+00, 3.25672257e+00, 5.98398182e+00, 1.66500538e+00,
1.26785592e+00, 3.03825784e+00, 4.72261799e-01, 8.23999374e-01,
5.46101965e+00, 1.91527370e+00, 3.31271919e+00, 3.32090638e+00,
3.37662760e+00, 7.52856758e-01, 6.95614598e+00, 4.24026379e+00,
5.73395193e+00, 2.78234086e+00, 3.40951226e+00, 2.15625154e+00,
4.49311674e+00, 7.26490020e+00, 2.73870119e+00, 3.86392023e+00,
4.00022055e+00, 1.13881462e+00, 1.38221904e+00, 2.39328785e+00,
3.90129922e+00, 1.55808991e+01, 1.34917745e+01, 7.45552061e+00,
1.65276278e+00, 6.94612188e-01, 3.81219520e+00, 4.42990442e+00,
6.96617389e-02, 6.65459614e+00, 3.05034295e+00, 1.58829140e+00,
2.52252767e-01, 6.52819867e+00, 9.20727901e-01, 5.24108417e+00,
1.74491497e+01, 6.05436047e+00, 7.34703975e+00, 3.44921688e-02,
5.31194241e-01, 1.19001393e+00, 7.10072032e+00, 2.28517534e-01,
1.10904203e+00, 9.45658281e-01, 2.74257031e+00, 1.90797192e+00,
2.56820986e-01, 1.00817073e+01, 6.97251145e+00, 1.12562541e+01,
1.10078485e+01, 1.79683013e+01, 4.58504276e+00, 7.81186722e-01,
3.86051458e+00, 9.84636014e+00, 2.05677188e+00, 2.86786551e+00,
1.31055748e+00, 1.64715255e+01, 2.36132754e+01, 2.01320841e+02,
1.93928943e+01, 9.26732367e+01, 7.96797732e+00, 1.66292060e+01,
6.42807082e+00, 2.90795586e+00, 4.35660098e+00, 5.20284942e+01,
4.25945573e+00, 8.79494139e+00, 3.31735570e+00, 2.09528776e+00,
2.93338362e+00, 2.10272333e+00, 8.09953046e-01, 1.99737411e+00,
3.55842630e+00, 5.32341914e+00, 4.09281232e+00, 2.15612420e+00,
3.37814570e+00, 5.53026984e+00, 4.55497721e+00, 1.41321587e+00,
1.63571906e+00, 3.49993104e+00, 2.56876272e+00, 5.42181971e+00,
1.97515575e+00, 4.32497840e+00, 3.48382202e+00, 2.68681799e-01,
1.92713179e+00, 8.14786803e-01, 8.56591575e+00, 8.56890760e-01,
1.52989618e+00, 3.63720872e+00, 1.14747838e+00, 7.61915867e-01,
1.69772199e-01, 9.20879020e-01, 1.78302545e+00, 2.26362077e-01,
9.92168658e-01, 1.00591148e+00, 7.10445853e-01, 8.76927891e+01,
2.82292246e+01, 5.51536587e+00, 7.06319946e+00, 6.22394557e+00,
2.29338710e+00, 6.80811647e-01, 2.94535369e+00, 1.55002056e+00,
1.89774157e-01, 3.68068436e+00, 2.90212101e-01, 7.99154610e-01,
3.56568288e-01, 5.57619635e-01, 3.02673506e+00, 3.43424115e+00,
1.41701952e-01, 1.00593789e+00, 2.35542981e-01, 5.08809539e-01,
1.18243504e-01, 1.70765293e+00, 7.02226267e-01, 4.66366007e-01,
4.64882253e-01, 9.27504704e-01, 1.34636817e+00, 2.02545847e+00,
1.92853842e-01, 9.12396385e+00, 2.89108735e+00, 4.93439776e-01,
3.60311685e+00, 9.77012624e-02, 2.43723626e-01, 2.93613907e+00,
7.58145635e-01, 5.03799633e-01, 1.58009523e+00, 1.22289856e+00,
2.99380946e+00, 4.08489305e-01, 3.87136657e+00, 4.91187751e-01,
6.84864214e-01, 1.77403454e-01, 6.21265433e-01, 5.72687011e-01,
1.92513065e+00, 2.08523023e+00, 2.07767047e+00, 1.57619730e+00,
2.38075372e-01, 3.60014483e-01, 1.48252123e-01, 1.39391168e-01,
2.25680047e+00, 4.93575746e-02, 7.97450701e-01, 3.52077505e-02,
8.13559066e-01, 1.69516064e-01, 2.09120907e-01, 5.45311471e-02,
3.90662726e-01, 1.63956678e-01, 7.85488736e-01, 2.62226043e-01,
2.63871475e+00, 1.32478782e+00, 1.56738753e+00, 7.17513908e-01,
2.32575545e-01, 6.85701375e-02, 6.61949036e-01, 8.07433026e-02,
4.47372811e-01, 4.50967790e-01, 2.92317155e+00, 2.12372454e-01,
1.29319865e-01, 2.49770301e-01, 3.38633171e-02, 3.21225592e-01,
4.31426720e+00, 2.21149658e+00, 2.87924231e-01, 1.13310180e-01,
5.57491600e-01, 4.98683908e-01, 5.10611892e-02, 3.75381097e-01,
3.88573861e-01, 2.06949144e+00, 3.73162024e-01, 9.31048106e-01,
1.13852143e+00, 1.79147135e+00, 1.72667439e+00, 1.63057734e+00,
1.23449695e-01, 8.13654707e-01, 2.53376926e-01, 9.84052083e-02,
1.34399867e+00, 1.65282682e-01, 2.39902033e+00, 1.21819939e-01,
7.30918867e-02, 3.12154109e-01, 2.74962232e-01, 8.73598748e-03,
3.42984722e-02, 3.36730453e-01, 2.08415722e-01, 5.34906749e-01,
2.33162352e-01, 7.02765342e-01, 3.34928503e-02, 2.01867960e-01,
2.10671077e+00, 3.94040808e-01, 1.12051869e+00, 1.07379461e+00,
5.08475545e-01, 3.10269756e-01, 7.71905473e-01, 1.21362843e+01,
8.76047361e-01, 1.61769013e-01, 1.59357174e-01, 2.65281231e-01,
4.77860216e-01, 3.23757560e+00, 1.02121815e+00, 8.06298253e-02,
5.23352578e-02, 8.37683316e-01, 4.13772067e-01, 9.51944011e-02,
4.76757313e-01, 6.54328017e+00, 1.53432999e+00, 1.30863190e+00,
9.78477973e-01, 5.38751398e-01, 1.14304007e+00, 6.53464059e+00,
7.56641795e+00, 1.22590817e+00, 1.48461761e+00, 1.16485995e+00,
1.99386941e-01, 1.29535916e-01, 1.35381507e+00, 5.23389318e-01,
1.24342489e+00, 8.41109801e-01, 1.11650053e+00, 1.11686956e+00,
1.11493724e+00, 7.46118516e+00, 5.28550978e+00, 6.60410296e+00,
6.05847793e-01, 6.11391890e-01, 1.48431242e-01, 1.18334599e-01,
1.86127329e+01, 7.81308465e+00, 1.34513984e+00, 1.95155458e-01,
1.38183265e+00, 3.34909811e+00, 1.27293472e+00, 3.36358997e+00,
1.54609830e+00, 7.67163779e-01, 8.73369236e-01, 7.96422029e-01,
3.76504834e-01, 3.91225172e-01, 3.70818243e-01, 2.18856321e-01,
9.10932964e-01, 1.62454296e-01, 8.86694977e-01, 1.58730097e+01,
4.51151083e+00, 6.74211273e+00, 5.22508907e-01, 1.17604720e+00,
1.24371010e+00, 8.95675127e-01, 1.59906708e-02, 2.81409041e+00,
1.16921989e+01, 1.36443907e+00, 2.22924605e+00, 4.30787293e-01,
6.32390544e-01, 7.61364388e-01, 2.54542530e+00, 1.22518600e+00,
2.95558010e+00, 6.03609367e+00, 8.38241789e-01, 1.12180641e+00,
3.58737730e+00, 6.09082470e-01, 1.70632195e+01, 6.82497273e+00,
1.76654485e+00, 1.25396943e+00, 6.91570455e-01, 1.87221859e+00,
1.08189149e+00, 3.07606290e+00, 2.06183518e+00, 4.10936078e+00,
3.72179511e+00, 1.49858296e+00, 3.92783548e+00, 3.57573926e+01,
4.10143099e+01, 1.02581401e+01, 2.85121418e+02, 6.65474518e+00,
1.35443934e+00, 6.17564317e+00, 4.00270287e+00, 6.94875398e+00,
8.83300033e-01, 1.99514762e+00, 1.44222220e+00, 4.48418559e+00,
1.72067152e+00, 9.23591573e-01, 4.73397998e+00, 3.75361672e+00,
3.21477191e+00, 1.40236440e+00, 2.00243993e+00, 4.96631280e+00,
4.59717500e+00, 1.34929270e+00, 1.41525841e+00, 3.16242416e+00,
7.59478578e-01, 1.05133261e+00, 1.78050950e+00, 7.73565493e+00,
1.51491141e+01, 4.86485172e+00, 5.69234057e+00, 9.01997584e-01,
4.40900218e+00, 3.77441754e-01, 1.05392841e+01, 3.45414720e+00,
1.01872050e+01, 6.03528113e+00, 2.04512767e+00, 1.14142614e+00,
1.42168662e+00, 3.52382447e+00, 7.66818589e+00, 5.74460550e+00,
3.06268628e+00, 4.73316456e+00, 9.24885709e-01, 7.66008210e-01,
3.29222221e+00, 6.61611957e+00, 4.24306612e-01, 1.13278781e+00,
3.55201580e+00, 9.81281445e-01, 1.39588856e+01, 3.97869226e+00,
1.73517426e+01, 5.92488576e+00, 2.80859110e+00, 5.61116613e+00])
### COMPUTE TRAIN T2 METRIC ###
# 残差均值
residuals_mean = var_result.resid.values.mean(axis=0)
# 残差标准差
residuals_std = var_result.resid.values.std(axis=0)
# ----------------------------------------------------------------------
# 下面的代码是计算T2统计量的步骤:
#
# 1.标准化残差 0均值,单位方差
residuals = (var_result.resid.values - residuals_mean) / residuals_std
# 2. 计算协方差矩阵,用于计算T2统计量
cov_residuals = np.linalg.inv(np.cov(residuals.T))
# 3. 计算T2统计量
T = np.diag((residuals).dot(cov_residuals).dot(residuals.T))
# print('T2:', T)
### COMPUTE UCL ###
# 计算UCL,即T2统计量的上限,用于检测异常,这里使用F分布来计算UCL。
m = var_result.nobs
p = var_result.resid.shape[-1]
alpha = 0.01
UCL = stats.f.ppf(1-alpha, dfn=p, dfd=m-p) *(p*(m+1)*(m-1)/(m*m-m*p))
print('UCL:', UCL)
👉🏻 Where F represents an F distribution with p and n-p degrees of freedom and alpha significance level. If T2 > UCL, then stop and investigate. The estimated Sigma obtained at the end of phase 1 (together with residuals mean and standard deviation) is used to calculate the T-squared statistic for each new observation.
如果T²统计量的值超过了上控制限(UCL),则是异常点。
'''
len(T) = 556
train.index[best_order:]
DatetimeIndex(['2013-12-22', '2013-12-23', '2013-12-24', '2013-12-25',
'2013-12-26', '2013-12-27', '2013-12-28', '2013-12-29',
'2013-12-30', '2013-12-31',
...
'2015-06-21', '2015-06-22', '2015-06-23', '2015-06-24',
'2015-06-25', '2015-06-26', '2015-06-27', '2015-06-28',
'2015-06-29', '2015-06-30'],
dtype='datetime64[ns]', length=556, freq='D')
'''
### PLOT TRAIN T2 METRIC ###
plt.figure(figsize=(16,6), dpi= 1000)
plt.plot(train.index[best_order:], T)
plt.ylabel('T-squared')
plt.axhline(UCL, c='red', linestyle='--')
<matplotlib.lines.Line2D at 0x1d563778c10>
AIC值的简单解释¶
假设你有两个VAR模型,一个模型有3个参数(VAR1),另一个有5个参数(VAR2)。
如果这两个模型的最大似然估计值相近,那么参数较少的VAR1(AIC值较低)通常被认为是更好的模型,因为它在达到相似的拟合程度时使用了更少的参数。
似然函数值越大,表明模型对观测数据的拟合越好
最大似然估计的简单解释¶
最大似然估计是一种在给定一组数据后,估计模型参数的方法。似然函数基于这样的原则:选定的参数应该使得我们观测到的数据出现的概率最大。换句话说,这些参数下的模型最能解释我们观测到的数据。
T2值和UCL值¶
T2值:在多变量统计过程控制中,T2值是Hotelling T2统计量,用于监测样本是否与控制组或基线数据存在显著差异。它是一种多元版本的t统计量,适用于多变量数据。
UCL(Upper Control Limit,上控制限):在统计质量控制中,UCL是控制图上的一个重要参数,用于界定过程变量的正常操作范围的上限。如果过程数据超出这个上限,可能表明过程出现异常,需要进一步的调查或干预。
3.2 Analysis 📝¶
VAR model extends the univariate autoregressive (AR) model by capturing the linear relations between multiple variables. For each input series, a regression is carried out. The original variables are regressed against their own lagged values and the lagged values of other variables. For our multivariate task, we take into account both bike and pedestrian series.
In a multivariate process system with the presence of serial correlation, we use VAR models to approximate the system and monitor the residuals as a serially independent series. Using a VAR to approximate a linear system is appropriate due to the physical principles of the process dynamics.
### TEST SPLIT IN MULTIVARIATE CASE ###
test = df_day[(train_hours//24-best_order):].copy()
test.drop('Total', inplace=True, axis=1)
test.shape
(109, 4)
### REMOVE LONG TERM SEASONALITY FROM TEST ###
test['Ped East'] = test.apply(lambda x: x['Ped East'] - month_mean_train['Ped East'][x.name.month], axis=1)
test['Ped West'] = test.apply(lambda x: x['Ped West'] - month_mean_train['Ped West'][x.name.month], axis=1)
test['Bike East'] = test.apply(lambda x: x['Bike East'] - month_mean_train['Bike East'][x.name.month], axis=1)
test['Bike West'] = test.apply(lambda x: x['Bike West'] - month_mean_train['Bike West'][x.name.month], axis=1)
### ITERATIVE PREDICTIONS ON TEST DATA ###
pred = []
'''
test:
Ped East Ped West Bike East Bike West
2015-06-27 -186.966667 -147.733333 -245.366667 -384.066667
2015-06-28 -178.966667 -81.733333 -579.366667 -355.066667
2015-07-01 -269.193548 -167.612903 -627.387097 -543.290323
... ... ... ... ...
2015-10-12 -75.193548 -62.645161 -274.129032 -255.548387
2015-10-13 -87.193548 -93.645161 -321.129032 -273.548387
109 rows × 4 columns
'''
for i in range(best_order, len(test)):
pred.append(var_result.forecast(test.iloc[i-best_order:i].values, steps=1))
pred = np.vstack(pred)
print(pred.shape)
### COMPUTE TEST T2 METRIC ###
residuals_test = test.iloc[best_order:].values - pred
residuals_test = (residuals_test - residuals_mean) / residuals_std
T_test = np.diag((residuals_test).dot(cov_residuals).dot(residuals_test.T))
(105, 4)
✏️ Here, I collected the VAR residuals and used them to build a threshold alert system, which flags an alarm in case of anomalous behaviors.
'''
from training
UCL = 13.510289428754227
'''
# 首先确定哪些T²值大于UCL
over_ucl_mask = T_test > UCL
# 获取对应这些异常值的时间索引
abnormal_timestamps = test.iloc[best_order:].index[over_ucl_mask]
# 打印异常值对应的时间戳
print("异常值对应的时间戳:")
print(abnormal_timestamps)
异常值对应的时间戳:
DatetimeIndex(['2015-07-30', '2015-07-31', '2015-08-01', '2015-08-02',
'2015-08-03', '2015-08-04', '2015-08-05', '2015-08-06',
'2015-08-09', '2015-09-18'],
dtype='datetime64[ns]', freq=None)
plt.figure(figsize=(16,6), dpi=1000)
plt.plot(test.iloc[best_order:].index, T_test, label='T-squared values')
plt.scatter(abnormal_timestamps, T_test[over_ucl_mask], color='red', label='Outliers') # 标记异常值
plt.axhline(UCL, color='red', linestyle='--', label='UCL')
plt.ylabel('T-squared')
plt.title('Test T-squared Metric with Outliers')
plt.legend()
plt.show()
Real time application
### COMPUTE TRAIN T2 METRIC ###
# 残差均值
residuals_mean = var_result.resid.values.mean(axis=0)
# 残差标准差
residuals_std = var_result.resid.values.std(axis=0)
# 标准化残差 0均值,单位方差
residuals = (var_result.resid.values - residuals_mean) / residuals_std
# 计算协方差矩阵,用于计算T2统计量
cov_residuals = np.linalg.inv(np.cov(residuals.T))
T = np.diag((residuals).dot(cov_residuals).dot(residuals.T))
abnormal_timestamps = []
def remove_seasonality(pred_data):
for column in ['Ped East', 'Ped West', 'Bike East', 'Bike West']:
pred_data[column] = pred_data.apply(lambda x: x[column] - month_mean_train[column][x.name.month], axis=1)
return pred_data
def predict_and_detect(pred_data):
'''
pred_data:
Ped East Ped West Bike East Bike West
2015-06-27 -82.983333 -27.366667 382.816667 62.466667
2015-06-28 -74.983333 38.633333 48.816667 91.466667
2015-06-29 -95.983333 -56.366667 -95.183333 -167.533333
2015-06-30 -68.983333 -25.366667 129.816667 -1.533333
2015-07-01 -126.096774 -35.806452 68.806452 -95.645161
2015-07-02 -114.096774 -84.806452 -41.193548 -129.645161
2015-07-03 -46.096774 -10.806452 59.806452 -57.645161
2015-07-04 -107.096774 -18.806452 87.806452 -43.645161
2015-07-05 -116.096774 -65.806452 228.806452 -32.645161
2015-07-06 -117.096774 -28.806452 -123.193548 -136.645161
'''
# 假设每次输入是10个监控值的DataFrame 10*4
new_pred = []
# 确保 pred_data 至少有 best_order 行数据
if len(pred_data) >= best_order:
for i in range(best_order, len(pred_data)):
new_pred.append(var_result.forecast(pred_data.iloc[i-best_order:i].values, steps=1))
# 如果 new_pred 为空,则不执行 np.vstack
if new_pred:
new_pred = np.vstack(new_pred)
else:
print("No predictions were made due to insufficient data.")
return None, pred_data.iloc[best_order:].index # 或者适当的默认值
else:
# print("Not enough data to make predictions.")
return None, pred_data.iloc[best_order:].index # 或者适当的默认值
'''
new_pred:
[[-29.7338733 26.23259132 89.36743286 41.52800336]
[-70.8853915 -25.6406824 5.66445042 -47.98118328]
[-76.01996643 -37.38817594 24.45153718 -19.39430962]
[ -0.30872747 27.63148597 9.98642279 -21.52362993]
[-74.73226428 -2.86779294 40.23452029 -8.11772349]
[-57.04008105 -31.02994369 111.39993613 22.53573073]]
'''
# 计算每个监控值的残差
residuals_test = pred_data.iloc[best_order:].values - new_pred
residuals_test = (residuals_test - residuals_mean) / residuals_std
# 计算T²值
if len(residuals_test.shape) == 1:
residuals_test = residuals_test.reshape(1, -1)
T_test_value = np.diag(residuals_test.dot(cov_residuals).dot(residuals_test.T))
# T²值: [ 7.73665024 1.20084192 2.0554539 4.89641667 12.1021739 3.15436206]
# 检测异常并记录时间戳
abnormal_timestamps = pred_data.iloc[best_order:].index[T_test_value > UCL]
print("异常值对应的时间戳: ", abnormal_timestamps)
return T_test_value, abnormal_timestamps
def plot_t_squared_with_outliers(pred_data, T_test_value, UCL, abnormal_timestamps):
# 添加T-squared值的线,可以指定一个特定的颜色
line_color = 'red' if not abnormal_timestamps.empty else 'blue'
fig.add_trace(go.Scatter(
x=pred_data.iloc[best_order:].index,
y=T_test_value,
mode='lines',
name='T-squared values',
line=dict(color=line_color) # 为T-squared值指定蓝色
))
# 标记异常值
# 提取异常值对应的T²值
outlier_values = [T_test_value[i] for i, date in enumerate(pred_data.iloc[best_order:].index) if date in abnormal_timestamps]
fig.add_trace(go.Scatter(
x=list(abnormal_timestamps),
y=outlier_values,
mode='markers',
marker=dict(color='red', size=10),
name='Outliers'
))
# 添加UCL水平线
fig.add_hline(y=UCL, line=dict(color='green', dash='dash'), annotation_text="UCL", annotation_position="bottom right")
# 设置图表标题和轴标签
fig.update_layout(
title='Test T-squared Metric with Outliers',
xaxis_title='Date',
yaxis_title='T-squared',
legend_title='Legend',
legend=dict(
yanchor="top",
y=0.99,
xanchor="left",
x=0.01
)
)
fig = go.Figure()
for i in range(0,500,10):
upper_bound = i + best_order + 10
if upper_bound > len(test):
upper_bound = len(test)
pred_data = test.iloc[i:upper_bound]
# Remove seasonality
# pred_data = remove_seasonality(pred_data)
# 预测并检测异常
T_test_value, abnormal_timestamps = predict_and_detect(pred_data)
# 绘制T²值和异常值
plot_t_squared_with_outliers(pred_data, T_test_value, UCL, abnormal_timestamps)
# 显示图形
fig.show()
'''
异常值对应的时间戳:
DatetimeIndex(['2015-07-30', '2015-07-31', '2015-08-01', '2015-08-02',
'2015-08-03', '2015-08-04', '2015-08-05', '2015-08-06',
'2015-08-09', '2015-09-18'],
dtype='datetime64[ns]', freq=None)
'''
异常值对应的时间戳: DatetimeIndex([], dtype='datetime64[ns]', freq='D')
异常值对应的时间戳: DatetimeIndex([], dtype='datetime64[ns]', freq='D')
异常值对应的时间戳: DatetimeIndex(['2015-07-30'], dtype='datetime64[ns]', freq='D')
异常值对应的时间戳: DatetimeIndex(['2015-07-31', '2015-08-01', '2015-08-02', '2015-08-03',
'2015-08-04', '2015-08-05', '2015-08-06', '2015-08-09'],
dtype='datetime64[ns]', freq=None)
异常值对应的时间戳: DatetimeIndex([], dtype='datetime64[ns]', freq='D')
异常值对应的时间戳: DatetimeIndex([], dtype='datetime64[ns]', freq='D')
异常值对应的时间戳: DatetimeIndex([], dtype='datetime64[ns]', freq='D')
异常值对应的时间戳: DatetimeIndex(['2015-09-18'], dtype='datetime64[ns]', freq='D')
异常值对应的时间戳: DatetimeIndex([], dtype='datetime64[ns]', freq='D')
异常值对应的时间戳: DatetimeIndex([], dtype='datetime64[ns]', freq='D')
异常值对应的时间戳: DatetimeIndex([], dtype='datetime64[ns]', freq='D')
"\n异常值对应的时间戳:\nDatetimeIndex(['2015-07-30', '2015-07-31', '2015-08-01', '2015-08-02',\n '2015-08-03', '2015-08-04', '2015-08-05', '2015-08-06',\n '2015-08-09', '2015-09-18'],\n dtype='datetime64[ns]', freq=None)\n"
Data Prediction You use the var_result.forecast() method to predict future values for each monitoring point. Assuming var_result is a pre-trained Vector Autoregression (VAR) model, it uses historical data (pred_data.iloc[i-best_order:i].values) to forecast future values (steps=1). VAR models can process multiple interrelated time series simultaneously.
Calculate Residuals Residuals are the differences between the actual observed values and the predicted values by the model. In your code, pred_data.iloc[best_order:].values - new_pred calculates these residuals. Residuals are fundamental in anomaly detection because they reflect the discrepancies between predictions and actual outcomes.
Normalize Residuals Normalizing residuals helps to remove scale effects and make the residuals comparable. This is achieved by (residuals_test - residuals_mean) / residuals_std, where residuals_mean and residuals_std are the mean and standard deviation of the residuals. This step prepares the residuals for further statistical analysis, especially important in multivariate settings.
Calculate T² Values Hotelling's T² statistic is a multivariate tool used to detect significant differences between a multivariate sample and a population. The calculation np.diag(residuals_test.dot(cov_residuals).dot(residuals_test.T)) computes the T² values for each sample, where cov_residuals is the covariance matrix of the residuals, assessing the correlation between residuals of different variables. High T² values suggest significant deviations from model predictions, potentially indicating anomalies.
Detect Anomalies and Record Timestamps Anomalies are detected by comparing each sample's T² value with a predefined Upper Control Limit (UCL). If the T² value exceeds the UCL, it indicates that the data point significantly differs from the model's predictions and is considered anomalous. T_test_value > UCL identifies these points, and pred_data.iloc[best_order:].index[T_test_value > UCL] records their timestamps.